
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
10.7.6 Better Optimization
A central optimization difficulty with RNNs regards the learning of long-term
dependencies (Hochreiter, 1991; Bengio et al., 1993, 1994). This difficulty has
been explained in detail in Section 8.2.6. The gist of the problem is that the
composition of the non-linear recurrence with itself over many many time steps
yields a highly non-linear function whose derivatives (e.g. of the state at T w.r.t.
the state at t < T, i.e. the Jacobian matrix
∂s
T
∂s
t
) tend to either vanish or explode
as T −t increases, because it is equal to the product of the state transition Jacobian
matrices
∂s
t+1
∂s
t
)
If it explodes, the parameter gradient ∇
θ
L also explodes, yielding gradient-
based parameter updates that are poor. A simple heuristic but practical solution
to this problem is discussed in the next section (Sec. 10.7.7). However, as discussed
in Bengio et al. (1994), if the state transition Jacobian matrix has eigenvalues that
are larger than 1 in magnitude, then it can yield to “unstable” dynamics, in the
sense that a bit of information cannot be stored reliably for a long time in the
presence of input “noise”. Indeed, the state transition Jacobian matrix eigenvalues
indicate how a small change in some direction (the corresponding eigenvector) will
be expanded (if the eigenvalue is greater than 1) or contracted (if it is less than
1).
An interesting idea proposed in Martens and Sutskever (2011) is that at the
same time as first derivatives are becoming smaller in directions associated with
long-term effects, so may the higher derivatives. In particular, if we use a second-
order optimization method (such as the Hessian-free method of Martens and
Sutskever (2011)), then we could differentially treat different directions: divide
the small first derivative (gradient) by a small second derivative, while not scaling
up in the directions where the second derivative is large (and hopefully, the first
derivative as well). Whereas in the scalar case, if we add a large number and a
small number, the small number is “lost”, in the vector case, if we add a large
vector with a small vector, it is still possible to recover the information about
the direction of the small vector if we have access to information (such as in the
second derivative matrix) that tells us how to rescale appropriately each direction.
One disadvantage of many second-order methods, including the Hessian-free
method, is that they tend to be geared towards “batch” training (or fairly large
minibatches) rather than “stochastic” updates (where only one example or a
small minibatch of examples are examined before a parameter update is made).
Although the experiments on recurrent networks applied to problems with long-
term dependencies showed very encouraging results in Martens and Sutskever
(2011), it was later shown that similar results could be obtained by much simpler
methods (Sutskever, 2012; Sutskever et al., 2013) involving better initialization,
a cheap surrogate to second-order optimization (a variant on the momentum
366